A New Coefficient of Correlation. What if you were told there exists aâ¦ | by Tim Sumner

A New Coefficient of Correlation. What if you were told there exists aâ¦ | by Tim Sumner | Mar, 2024

Before presenting the formula, it is crucial to review some necessary preparation. As mentioned earlier, correlation can be viewed as a method of measuring the connection between two variables. Let’s consider measuring the current correlation between X and Y. If there is a linear relationship, it can be seen as mutually shared, meaning the correlation between X and Y is always the same as the correlation between Y and X. However, with this new approach, we are no longer measuring the linear relationship between X and Y; instead, our goal is to determine how much Y is influenced by X. Understanding this subtle but significant difference from traditional correlation techniques will make comprehending the formulas much simpler because it is not always the case anymore that ξ(X,Y) equals ξ(Y,X).

Continuing with the same line of thought, suppose we intend to measure how much Y depends on X. Each data point is an ordered pair of X and Y. Initially, we need to arrange the data as (X₁,Y₁),…,(Xₙ,Yₙ) in a way that ensures X₁ ≤ X₂ ≤ … ≤ Xₙ. In other words, we must organize the data based on X. Subsequently, we can establish the variables r₁, r₂, … ,rₙ where rₐ equals the rank of Yₐ. With these ranks determined, we are prepared to compute.

Two formulas are utilized based on the type of data being handled. If ties in the data are impossible (or highly improbable), we have

and if ties are allowed, we have

where rₓ is defined as the number of j such that Yₓ ±ₐ ≥ Yₓⱼ. It is essential to randomly sort the observed ties to ensure the best estimate possible and avoid (rₓ₊ᵢ – rₓ) being zero. The variable rₓ then represents the number of instances where Yₓⱼ is greater than or equal to.

Without delving too deeply into theory, it is worth mentioning that this new correlation is supported by useful asymptotic theory, making hypothesis testing straightforward without making assumptions about the distributions. This is because the method relies on the rank of the data rather than the values themselves, making it a nonparametric statistic. If X and Y are independent and Y is continuous, then

This implies that with a sufficiently large sample size, this correlation statistic approximately follows a normal distribution, which can be beneficial for testing the independence between the variables under examination.

Source link